NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

https://doi.org/10.1007/978-3-031-73404-5_20

Chatterjee, Agneet; Luo, Yiran; Gokhale, Tejas; Yang, Yezhou; Baral, Chitta (October 2024, Springer Nature Switzerland)

Full Text Available
On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation

https://doi.org/10.1109/CVPR52733.2024.00270

Chatterjee, Agneet; Gokhale, Tejas; Baral, Chitta; Yang, Yezhou (June 2024, IEEE)

Full Text Available
Getting it Right: Improving Spatial Consistency in Text-to-Image Models

https://doi.org/10.1007/978-3-031-72670-5_12

Chatterjee, Agneet; Stan, Gabriela_Ben Melech; Aflalo, Estelle; Paul, Sayak; Ghosh, Dhruba; Gokhale, Tejas; Schmidt, Ludwig; Hajishirzi, Hannaneh; Lal, Vasudev; Baral, Chitta; et al (September 2024, Springer Nature Switzerland)

Full Text Available
ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models

https://doi.org/10.1609/aaai.v38i13.29371

Patel, Maitreya; Gokhale, Tejas; Baral, Chitta; Yang, Yezhou (March 2024, Proceedings of the AAAI Conference on Artificial Intelligence)

The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding. To quantify the ability of T2I models in learning and synthesizing novel visual concepts (a.k.a. personalized T2I), we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, and 33K composite text prompts. Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (CCD), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in target images. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our human study shows that CCD is highly correlated with human understanding of concepts. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome. The data, code, and interactive demo is available at: https://conceptbed.github.io/
more » « less
Full Text Available
End-to-end Knowledge Retrieval with Multi-modal Queries

Luo, Man; Fang, Zhiyuan; Gokhale, Tejas; Yang, Yezhou; Baral, Chitta (July 2023, 61st Annual Meeting of the Association for Computational Linguistics)

We investigate knowledge retrieval with multi-modal queries, i.e. queries containing information split across image and text inputs, a challenging task that differs from previous work on cross-modal retrieval. We curate a new dataset called ReMuQ for benchmarking progress on this task. ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model “ReViz” that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion without being dependent on intermediate modules such as object detectors or caption generators. We introduce a new pretraining task that is effective for learning knowledge retrieval with multimodal queries and also improves performance on downstream tasks. We demonstrate superior performance in retrieval on two datasets (ReMuQ and OK-VQA) under zero-shot settings as well as further improvements when finetuned on these datasets.
more » « less
Full Text Available
Covariate Shift Detection via Domain Interpolation Sensitivity

Gokhale, Tejas (July 2022, First Workshop on Interpolation Regularizers and Beyond at NeurIPS 2022)

Covariate shift is a major roadblock in the reliability of image classifiers in the real world. Work on covariate shift has been focused on training classifiers to adapt or generalize to unseen domains. However, for transparent decision making, it is equally desirable to develop covariate shift detection methods that can indicate whether or not a test image belongs to an unseen domain. In this paper, we introduce a benchmark for covariate shift detection (CSD), that builds upon and complements previous work on domain generalization. We use state-of-the-art OOD detection methods as baselines and find them to be worse than simple confidence-based methods on our CSD benchmark. We propose an interpolation-based technique, Domain Interpolation Sensitivity (DIS), based on the simple hypothesis that interpolation between the test input and randomly sampled inputs from the training domain, offers sufficient information to distinguish between the training domain and unseen domains under covariate shift. DIS surpasses all OOD detection baselines for CSD on multiple domain generalization benchmarks.
more » « less
Full Text Available
Improving Diversity with Adversarially Learned Transformations for Domain Generalization

https://doi.org/10.1109/WACV56688.2023.00051

Gokhale, Tejas; Anirudh, Rushil; Thiagarajan, Jayaraman J.; Kailkhura, Bhavya; Baral, Chitta; Yang, Yezhou (January 2023, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV))

Full Text Available
Semantically Distributed Robust Optimization for Vision-and-Language Inference

https://doi.org/10.18653/v1/2022.findings-acl.118

Gokhale, Tejas; Chaudhary, Abhishek; Banerjee, Pratyay; Baral, Chitta; Yang, Yezhou (January 2022, ACL 2022 Findings)

Full Text Available
Weakly Supervised Relative Spatial Reasoning for Visual Question Answering

https://doi.org/10.1109/ICCV48922.2021.00192

Banerjee, Pratyay; Gokhale, Tejas; Yang, Yezhou; Baral, Chitta (October 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV))

Full Text Available
WeaQA: Weak Supervision via Captions for Visual Question Answering

https://doi.org/10.18653/v1/2021.findings-acl.302

Banerjee, Pratyay; Gokhale, Tejas; Yang, Yezhou; Baral, Chitta (January 2021, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021)
Zong, Chengqing; Xia, Fei; Li, Wenjie; Navigli, Roberto (Ed.)
Methodologies for training visual question answering (VQA) models assume the availability of datasets with human-annotated ImageQuestion-Answer (I-Q-A) triplets. This has led to heavy reliance on datasets and a lack of generalization to new types of questions and scenes. Linguistic priors along with biases and errors due to annotator subjectivity have been shown to percolate into VQA models trained on such samples. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and their associated textual descriptions or captions. We present a method to train models with synthetic Q-A pairs generated procedurally from captions. Additionally, we demonstrate the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models. Our experiments on three VQA benchmarks demonstrate the efficacy of this weakly-supervised approach, especially on the VQA-CP challenge, which tests performance under changing linguistic priors.
more » « less
Full Text Available

« Prev Next »

Search for: All records